TECHNICAL  REPORT  TR- 10-03,  UC  DAVIS,  OCTOBER,  2010. 


1 


Learning  in  A  Changing  World:  Non-Bay esian  Restless 

Multi-Armed  Bandit 

Haoyang  Liu,  Keqin  Liu,  Qing  Zhao 
University  of  California,  Davis,  CA  95616 
{liu,  kqliu,  qzhao}@ucdavis.edu 


Abstract 

We  consider  the  restless  multi-armed  bandit  (RMAB)  problem  with  unknown  dynamics.  In  this 
problem,  at  each  time,  a  player  chooses  K  out  of  N  (N  >  K)  arms  to  play.  The  state  of  each  arm 
determines  the  reward  when  the  arm  is  played  and  transits  according  to  Markovian  rules  no  matter  the 
arm  is  engaged  or  passive.  The  Markovian  dynamics  of  the  arms  are  unknown  to  the  player.  The  objective 
is  to  maximize  the  long-term  reward  by  designing  an  optimal  arm  selection  policy.  The  performance 
of  a  policy  is  measured  by  regret,  defined  as  the  reward  loss  with  respect  to  the  case  where  the  player 
knows  which  K  arms  are  the  most  rewarding  and  always  plays  these  K  best  arms.  We  construct  a 
policy,  referred  to  as  Restless  Upper  Confidence  Bound  (RUCB),  that  achieves  a  regret  with  logarithmic 
order  of  time  when  an  arbitrary  nontrivial  bound  on  certain  system  parameters  is  known.  When  no 
knowledge  about  the  system  is  available,  we  extend  the  RUCB  policy  to  achieve  a  regret  arbitrarily 
close  to  the  logarithmic  order.  In  both  cases,  the  system  achieves  the  maximum  mean  reward  offered  by 
the  K  best  arms.  Potential  applications  of  these  results  include  cognitive  radio  networks,  opportunistic 
communications  in  unknown  fading  environments,  and  financial  investment. 

Index  Terms 

Restless  multi-armed  bandit,  non-Bayesian  formulation,  regret,  logarithmic  order 

I.  Introduction 

The  Restless  Multi- Armed  Bandit  (RMAB)  problem  is  a  generalization  of  the  classic  Multi- 
Armed  Bandit  (MAB)  problem.  In  the  classic  MAB,  there  are  N  independent  arms  and  a  single 
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player.  At  each  time,  the  player  chooses  one  arm  to  play  and  receives  certain  amount  of  reward. 
The  reward  (i.e.,  the  state)  of  each  arm  evolves  as  an  i.i.d.  process  over  successive  plays.  The 
reward  distribution  of  each  arm  is  unknown  to  the  player.  The  objective  is  to  maximize  the  long¬ 
term  reward  by  designing  an  optimal  arm  selection  policy.  This  problem  involves  the  well-known 
dilemma  between  exploitation  and  exploration.  For  exploitation,  the  player  tends  to  select  the 
arm  suggested  by  past  reward  observations  as  the  best.  For  exploration,  the  player  selects  an  arm 
to  leam  its  reward  statistics.  Under  the  non-Bayesian  formulation,  the  performance  measure  of 
an  arm  selection  policy  is  given  by  regret,  defined  as  the  reward  loss  compared  with  the  optimal 
performance  in  the  ideal  scenario  of  a  known  reward  model  [1].  Note  that  in  the  ideal  scenario, 
the  player  will  always  play  the  arm  with  the  highest  mean  reward.  The  essence  of  the  problem 
is  to  identify  the  best  arm  without  engaging  other  inferior  arms  too  often. 

In  1985,  Lai  and  Robbins  showed  that  the  minimum  regret  grows  with  time  in  a  logarithmic 
order  [1],  A  policy  was  further  constructed  to  achieve  the  minimum  regret  (both  the  logarithmic 
order  and  the  best  leading  constant)  [1],  In  1987,  Anantharam  et  al.  extended  Lai  and  Robbins’s 
results  to  accommodate  multiple  simultaneous  plays  [2]  and  Markovian  reward  model  where  the 
reward  of  each  arm  evolves  as  an  unknown  Markov  process  over  successive  plays  and  remains 
frozen  when  the  arm  is  not  played  (the  so-called  rested  Markovian  reward  model)  [3].  For  both 
extensions,  the  minimum  regret  growth  rate  has  been  shown  to  be  logarithmic  [2],  [3].  There 
are  also  several  simpler  index  policies  that  achieve  logarithmic  regret  for  the  classic  MAB  under 
an  i.i.d.  reward  model  [4],  [5].  In  particular,  the  index  policy — referred  to  as  Upper  Confidence 
Bound  1  (UCB1) — proposed  in  [5]  achieves  the  logarithmic  regret  with  a  uniform  bound  on  the 
leading  constant  over  time.  In  [6],  UCB1  was  extended  to  the  rested  Markovian  reward  model 
adopted  in  [3]. 

A.  Restless  Multi-Armed  Bandit  with  Unknown  Dynamics 

Different  from  the  classic  MAB,  in  an  RMAB,  the  state  of  each  arm  can  change  (according 
to  an  unknown  Markovian  rule)  even  when  the  arm  is  not  played.  The  unknown  state  transition 
matrix  when  the  arm  is  played  can  be  different  from  that  when  it  is  not  played.  We  consider 
the  general  case  where  K  (K  <  N)  arms  are  simultaneously  played  at  each  time.  Even  with  a 
known  model,  the  RMAB  problem  has  been  shown  to  be  P-SPACE  hard  in  general  [7]. 

In  this  paper,  we  address  the  RMAB  problem  with  unknown  Markovian  dynamics.  Similar  to 


TECHNICAL  REPORT  TR- 10-03,  UC  DAVIS,  OCTOBER,  2010. 


3 


the  classic  MAB,  we  measure  the  performance  of  a  policy  by  regret,  defined  as  the  reward  loss 
compared  to  the  case  when  the  player  knows  which  K  arms  are  most  rewarding  and  always 
plays  the  K  best  arms.  We  show  that  for  RMAB,  logarithmic  regret  can  also  be  achieved  as  in 
the  classic  MAB.  Specifically,  we  construct  a  policy  that  achieves  logarithmic  regret  when  an 
arbitrary  nontrivial  bound  on  certain  system  parameters  is  known.  When  no  knowledge  about 
the  system  is  available,  we  show  that  a  variation  of  the  policy  achieves  a  regret  arbitrarily  close 
to  logarithmic  order,  i.e.,  the  regret  has  order  f(t)  log(t)  for  any  increasing  function  /(f)  with 
/(f)  — >  oo  as  time  f  — >  oo.  In  both  cases,  the  proposed  policy  achieves  the  maximum  mean 
reward  offered  by  the  K  best  arms. 

Referred  to  as  the  Restless  Upper  Confidence  Bound  (RUCB),  the  proposed  policy  borrows  the 
basic  index  form  of  the  UCB-1  policy  developed  in  [5]  for  the  classic  MAB  under  i.i.d.  reward 
models.  To  handle  the  restless  nature  of  the  problem,  the  basic  structure  of  the  proposed  RUCB 
policy  is  fundamentally  different  from  that  of  UCB-1.  Specifically,  the  basic  structure  of  RUCB 
consists  of  interleaving  exploitation  and  exploration  epochs  with  carefully  controlled  lengths 
to  bound  the  frequency  of  arm  switching  and  balance  the  tradeoff  between  exploitation  and 
exploration.  Another  novelty  of  this  paper  is  a  general  technique  in  choosing  policy  parameters 
whose  value  may  have  to  depend  on  the  range  of  certain  system  parameters.  We  show  that  by 
letting  these  policy  parameters  grow  with  time  (rather  than  fixed  a  priori ),  one  can  get  around 
with  the  dependency  of  the  policy  parameters  on  system  parameters  and  achieve  a  regret  order 
arbitrarily  close  to  logarithmic  without  any  knowledge  about  the  system. 

We  point  out  that  the  definition  of  regret  adopted  in  this  paper,  while  similar  to  that  used  for 
the  classic  MAB,  is  a  weaker  version  of  its  counterpart  in  the  classic  MAB.  In  the  classic  MAB 
with  either  i.i.d.  or  rested  Markovian  reward,  the  optimal  policy  under  known  model  is  to  stay 
with  the  best  arm  in  terms  of  the  reward  mean.  For  RMAB,  however,  the  optimal  policy  under 
known  model  is  no  longer  given  by  staying  with  the  arm  with  the  highest  mean  reward.  Defining 
the  regret  in  terms  of  this  optimal  policy  would  require  that  a  general  RMAB  with  known  model 
be  solved  and  optimal  performance  analyzed  before  the  regret  under  unknown  model  can  be 
approached.  Unfortunately,  RMAB  under  known  model  itself  is  intractable  in  general  [7].  In 
this  paper,  we  adopt  a  weaker  definition  of  regret  where  the  performance  is  compared  with  a 
“partially-informed”  genie  who  knows  only  which  k  arms  have  the  highest  mean  reward  instead 
of  the  complete  system  dynamics.  This  definition  of  regret  leads  to  a  tractable  problem,  but  at 
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the  same  time,  weaker  results.  Whether  stronger  results  for  a  general  RMAB  under  unknown 
model  can  be  obtained  is  still  open  for  exploration  (see  more  discussions  in  Sec.  I-C  on  related 
work). 

B.  Applications 

The  restless  multi-armed  bandit  problem  has  a  broad  range  of  applications.  For  example,  in  a 
cognitive  radio  network,  a  secondary  user  searches  among  several  channels  for  idle  slots  that  are 
temporarily  unused  by  primary  users.  The  state  of  each  channel  (busy  or  idle)  can  be  modeled 
as  a  two-state  Markov  chain.  At  each  time,  a  secondary  user  chooses  one  channel  to  sense  and 
subsequently  transmit  if  the  channel  is  found  in  the  idle  state.  The  objective  of  the  secondary 
user  is  to  maximize  the  long-term  throughput  by  designing  an  optimal  channel  selection  policy 
without  knowing  the  traffic  dynamics  of  the  primary  users. 

Consider  opportunistic  transmission  over  multiple  wireless  channels  with  unknown  Markovian 
fading.  In  each  slot,  a  user  senses  the  fading  realization  of  a  selected  channel  and  chooses  its 
transmission  power  or  date  rate  accordingly.  The  reward  can  model  energy  efficiency  (for  fixed- 
rate  transmission)  or  throughput.  The  objective  is  to  design  the  optimal  channel  selection  policies 
under  unknown  fading  dynamics. 

Another  potential  application  is  financial  investment,  where  a  Venture  Capital  (VC)  selects 
one  company  to  invest  at  each  year.  The  state  (e.g.,  annual  profit)  of  each  company  evolves  as  a 
Markov  chain  with  the  transition  matrix  depending  on  whether  the  company  is  invested  or  not. 
The  objective  of  the  VC  is  to  maximize  the  long-run  profit  by  designing  the  optimal  investment 
strategy  without  knowing  the  market  dynamics  a  priori. 

The  proposed  policy  for  RMAB  also  provides  a  basic  building  block  for  constructing  decen¬ 
tralized  policies  for  MAB  with  multiple  distributed  players  under  a  Markovian  reward  model  [8] 
(Decentralized  MAB  was  first  formulated  and  solved  under  an  i.i.d.  reward  model  in  [9]).  In  the 
decentralized  MAB  with  Markovian  reward,  multiple  distributed  players  select  arms  to  play  and 
collide  when  they  select  the  same  arm.  Arms  are  rested,  i.e.,  they  do  not  change  states  when  they 
are  not  played.  However,  from  each  player’s  point  of  view,  each  arm  is  restless  since  its  state 
can  be  changed  by  other  players.  Applying  the  RUCB  policy  proposed  here  to  the  decentralized 
rested  multi-armed  bandit  problem  leads  to  the  optimal  logarithmic  order  of  the  regret  [8]. 
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C.  Related  Work 

This  paper  is  among  the  few  first  attempts  on  RMAB  under  uknown  models.  There  are  two 
parallel  independent  investigations  reported  in  [10]  and  [11],  In  [10],  Tekin  and  Liu  adopted  the 
same  definition  of  regret  as  used  in  this  paper  and  proposed  a  policy  that  achieves  logarithmic 
(weak)  regret  when  certain  knowledge  about  the  system  parameters  is  available  [10],  The  policy 
proposed  in  [10]  also  uses  the  index  form  of  UCB-1  given  in  [5],  but  the  structure  is  different 
from  RUCB  proposed  in  this  paper.  In  [11],  a  stronger  definition  of  regret  is  adopted,  where 
regret  is  defined  as  reward  loss  with  respect  to  the  optimal  performance  in  the  ideal  scenario  of 
known  reward  model.  However,  the  problem  can  only  be  solved  for  a  special  class  of  RMAB. 
Specifically,  when  arms  are  governed  by  stochastically  identical  two-state  Markov  chains,  a 
policy  was  constructed  in  [11]  to  achieve  a  regret  with  an  order  arbitrarily  close  to  logarithmic. 

The  RMAB  with  known  reward  model  has  been  extensively  studied  in  the  literature.  In  [12], 
Whittle  proposed  a  heuristic  index  policy  that  generalizes  Gittins  optimal  index  policy  for 
the  classic  MAB  with  known  reward  model  [13].  Weber  showed  that  Whittle  index  policy  is 
asymptotically  optimal  (as  the  number  of  arms  goes  to  infinity)  under  certain  conditions  [14]. 
In  the  finite  regime,  the  optimality  of  Whittle  index  policy  has  been  shown  for  certain  special 
families  of  RMAB  (see,  for  example,  [15]). 

II.  Problem  Formulation 

In  the  RMAB  problem,  we  have  one  player  and  N  independent  arms.  At  each  time,  the  player 
can  choose  K  (K  <  N )  arms  to  play  (we  focus  on  K  =  1  for  the  simplicity  of  presentation). 
Each  arm,  when  played  (activated),  offers  certain  amount  of  reward  that  models  the  current  state 
of  the  arm.  Let  Sj(t)  denote  the  state  of  arm  j  at  time  t.  No  matter  an  arm  is  played  or  not,  the 
state  of  the  arm  changes  according  to  a  Markovian  rule.  In  general,  the  transition  matrices  in 
the  active  mode  and  the  passive  mode  are  not  necessarily  the  same.  The  player  does  not  know 
the  transition  matrices  of  the  arms.  The  objective  is  to  choose  one  arm  to  play  at  each  time  in 
order  to  maximize  the  expected  total  reward  collected  in  the  long  run. 

Let  Sj  denote  the  state  space  of  arm  j.  Each  arm  is  assumed  to  have  a  finite  state  space. 
Different  arms  can  have  different  state  spaces.  Let  Pj  denote  the  active  transition  matrix  of  arm 
j  and  Qj  the  passive  transition  matrix.  All  transition  matrices  are  assumed  to  be  irreducible, 
aperiodic,  and  reversible.  Let  7 f)  =  {7r^"}s&s.  denote  the  stationary  distribution  of  arm  j  in  the 
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active  mode  (i.e.,  under  P3),  where  irls  is  the  stationary  probability  (under  P3)  that  arm  j  is  in 
state  s.  The  stationary  mean  reward  /jl3  is  given  by  fi3  =  YlseSj  S7rs-  Let  a  be  a  permutation  of 
{1,  •  •  •  ,  N}  such  that 

P<j{  1)  >  /U(2)  >  /U(3)  >  '  '  >  Pa(N)- 

A  policy  $  is  a  rule  that  specifies  the  arm  to  play  based  on  the  observation  history.  Let  t,  (n) 
denote  the  time  index  of  the  nth  play  on  arm  j,  and  Tj(t)  the  total  number  of  plays  on  arm  j 
by  time  t.  Notice  that  both  tj(n)  and  T:I ( t )  are  random  variables  with  distributions  determined 
by  the  policy  <3>.  The  total  reward  by  time  t  is  given  by 

N  p(t) 

R(t)  =  (!) 

j= i  ™=i 

As  mentioned  in  Sec.  I,  the  regret  r$(t)  achieved  by  policy  $  is  defined  as  the  reward  loss 
with  respect  to  the  case  where  the  player  knows  which  arm  has  the  highest  mean  reward  and 
always  plays  this  best  arm.  We  thus  have 

r$(f)  =  t/V( i)  -  E, $R(t),  (2) 

where  E$  denotes  the  expectation  with  respect  to  the  random  process  induced  by  policy  <3>.  The 
objective  is  to  minimize  the  growth  rate  of  the  regret. 

III.  The  RUCB  Policy 

The  proposed  policy  RUCB  is  based  on  an  epoch  structure.  We  divide  the  time  into  disjoint 
epochs.  There  are  two  types  of  epochs:  exploitation  epochs  and  exploration  epochs  (see  an 
illustration  in  Fig.  1).  In  the  exploitation  epochs,  the  player  calculates  indexes  of  all  arms  and 
play  the  arm  with  the  highest  index,  which  is  believed  to  be  the  best  arm.  In  the  exploration 
epochs,  the  player  obtains  information  of  all  arms  by  playing  them  equally  many  times.  The 
purpose  of  the  exploration  epochs  is  to  make  decisions  in  the  exploitation  epochs  sufficiently 
accurate.  As  shown  in  Fig.  1,  in  the  nth  exploration  epoch,  the  player  plays  every  arm  4n_1 
times.  At  the  beginning  of  the  nth  exploitation  epoch  the  player  calculates  index  for  every  arm 
(see  (4)  in  Fig.  2)  and  selects  the  arm  with  the  highest  index  (denoted  as  arm  a*).  The  player 
keeps  playing  arm  a*  till  the  end  of  this  epoch  that  has  length  2  x  4n_1.  How  the  two  types 
of  epochs  interleave  is  detailed  in  Step  2  in  Fig.  2.  Specifically,  whenever  sufficiently  many 
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( D  In  t,  see  (3))  observations  have  been  obtained  from  every  arm  in  the  exploration  epochs, 
the  player  is  ready  to  proceed  with  a  new  exploitation  epoch.  Otherwise,  another  exploration 
epoch  is  required  to  gain  more  information  about  each  arm.  It  is  also  implied  in  (3)  that  only 
logarithmically  many  plays  are  spent  in  the  exploration  epochs,  which  is  one  of  the  key  reasons 
for  the  logarithmic  regret  of  RUCB.  This  also  implies  that  the  exploration  epochs  are  much  less 
frequent  than  the  exploitation  epochs.  Though  the  exploration  epochs  can  be  understood  as  the 
“information  gathering”  phase,  and  the  exploitation  epochs  as  the  “information  utilization”  phase, 
observations  obtained  in  the  exploitation  epochs  are  also  used  in  learning  the  arm  dynamics.  This 
can  be  seen  in  Step  3  in  Fig.  2.  In  calculating  the  indexes  using  (4),  observations  from  both  the 
exploration  and  exploitation  epochs  are  used.  This  is  different  from  the  policy  in  [10],  which 
only  uses  part  of  the  past  observations  in  calculating  indexes.  A  complete  description  of  the 
proposed  policy  is  given  in  Fig.  2. 


Exploration  epochs 


Exploitation  epochs 


The  general  structure  of  RUCB 


arm 

arm 

arm 

arm 

arm 

arm 

arm 

i 

i 

i 

2 

2 

N 

N 

1  t  1  t 

Slot  1  2  271-1  2  X  4n_1  (AT  -  1)  x  4"-1  +  1  N  X  4™— 1 

Structure  of  the  nth  exploration  epoch 


Compute  the  indexes  and  identify  the  arm  with  the  highest  index  (denote  it  as  arm  a*) 


Slot  1 


2x4" 


Structure  of  the  nth  exploitation  epoch 


Fig.  1.  Epoch  structures  of  RUCB 
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Time  is  divided  into  epochs.  There  are  two  types  of  epoch,  exploration  epoch  and  exploitation  epoch.  At 
the  beginning  of  the  nth  exploitation  epoch,  we  choose  one  arm  to  play  for  2  x  4n~1  many  times.  In  the 
nth  exploration  epoch,  we  play  every  arm  4”_1  many  times.  Let  noit)  denote  the  number  of  exploration 
epochs  played  by  time  t  and  n/(f)  the  number  of  exploitation  epochs  played  by  time  t. 

1.  At  t  =  1,  we  start  the  first  exploration  epoch,  in  which  every  arm  is  played  once.  We  set  no(N+ 1)  = 
1,  ni(N  +  1)  =  0.  Then  go  to  Step  2. 

2.  Let  X\  (t)  =  (4n°^  —  l)/3  be  the  time  spent  on  each  arm  in  exploration  epochs  by  time  t.  Choose 
D  according  to  (5)(6).  If 

Xi(t)  >  D  Inf,  (3) 


go  to  Step  3  (start  an  exploitation  epoch).  Otherwise,  go  to  Step  4  (start  an  exploration  epoch). 

3.  Calculate  indexes  ri,  t  for  all  arms  using  the  formula  below: 


Si{t)  + 


(4) 


where  t  is  the  current  time,  s,  (f)  is  the  sample  mean  from  arm  i  by  time  t,  L  is  chosen  according 
to  (5),  and  X)(t)  is  the  number  of  times  we  have  played  arm  i  by  time  t.  Then  choose  the  arm  with 
the  highest  index  and  play  it  for  2  x  4*-"7-1-1  slots.  Increase  n/  by  one.  Go  to  step  2. 

4.  Play  each  arm  for  4Uo_1)  slots.  Increase  no  by  one.  Go  to  Step  2. 


Fig.  2.  RUCB  policy 


IV.  The  Logarithmic  Regret  of  RUCB 


In  this  section,  we  show  that  the  regret  achieved  by  the  RUCB  policy  has  a  logarithmic  order. 
This  is  given  in  the  following  theorem. 

Theorem  1:  Assume  all  arms  are  modeled  as  finite  state,  irreducible,  aperiodic,  and  reversible 
Markov  chains.  All  the  states  (rewards)  are  positive.  Let  7rmill  =  minse5ijl<j<7v  7r*,  emax  = 
ULcl Xi<^<Ar  ^min  ^max  5min  mmseSi,i<i<N  s,  and  |«S|  max 

maxi<j<Ar  5,;  |  where  g  is  the  second  largest  eigenvalue  of  Pi.  Let  M  <  N  denote  the  number 
of  optimal  arms.  Set  the  policy  parameters  L  and  I)  to  satisfy  the  following  conditions: 


L  > 


■(4 


max  I  I  max 


(3  -  2y/2) 


+  10Smax)> 


(5) 
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D  > 


4  L 


(/i  /2cr(7lT+l) ) 

The  regret  of  RUCB  at  the  end  of  any  epoch  can  be  upper  bounded  by 


(6) 


r*{t)  <  (Rog4(^(t  -  M)  +  1)])  max  Aj 
Z  i 

+iV(Llog4(3£)lnt  +  1)J  +  l)maxAj 

i 

+ y>-  -  ft)(riog4(|(i  -  m)  +  i)i3M±J£!l(i  +  e^E) 

"  '‘min  lUomin 

i 

+  -  Hi)±[4(3D  hit  +  1)  -  1]),  (7) 

i 

where  At  =  (min.sGSi  tt*)-1  £s€S.  s. 

Proof:  See  Appendix  A  for  details.  ■ 

In  RUCB,  to  ensure  logarithmic  regret  order,  the  policy  parameters  L  and  I)  need  to  be 
chosen  appropriately.  This  requires  an  arbitrary  (nontrivial)  bound  on  s^1RY,  |«S|max,  e„Mn-  and 
h*  —  Pa(M+i)-  In  the  case  where  these  bounds  are  unavailable,  D  and  L  can  be  chosen  to  increase 
with  time  to  achieve  a  regret  order  arbitrarily  close  to  logarithmic  order.  This  is  formally  stated 
in  the  following  theorem. 

Theorem  2:  Assume  all  arms  are  modeled  as  finite  state,  irreducible,  aperiodic,  and  reversible 
Markov  chains.  For  any  increasing  sequence  f(t )  (/(f)  -»  oo  as  t  ->  oo),  if  L(t)  and  D(t)  are 
chosen  such  that  L{t)  — >  oo  as  t  — >•  oo  ,j^y  —^ooast^oo,  and  — >  oo  as  t  — >  oo,  then 

we  have 


r*{t)  ~  o(f(t)  log (t)). 

Proof:  See  Appendix  B  for  details. 


(8) 


V.  Conclusion 

In  this  paper,  we  considered  the  non-Bayesian  restless  multi-armed  bandit  problem.  We  adopted 
the  definition  of  regret  from  the  classic  MAB  and  developed  a  policy  that  achieves  logarithmic 
regret  when  an  arbitrary  (nontrivial)  bound  on  certain  system  parameters  is  known.  When  no 
knowledge  about  the  system  is  available,  we  extend  the  RUCB  policy  to  achieve  a  regret  with 
an  order  arbitrarily  close  to  logarithmic. 
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Appendix  A.  Proof  of  Theorem  1 
We  first  rewrite  the  definition  of  regret  as 

N  Ti(t)  N 

r<s>{t)  =  t/j*  -  E$R(t)  =  y^EpZ#)]  -  E[^  si(ti(n))]]  +  ^ (//  -  (9) 

i=l  n=  1  i=l 

To  show  that  the  regret  has  a  logarithmic  order,  it  is  sufficient  to  show  that  both  terms  in  (9) 
have  logarithmic  orders.  The  first  term  in  (9)  can  be  understood  as  the  regret  caused  by  arm 
switching.  The  second  term  can  be  understood  as  the  regret  caused  by  engaging  a  bad  arm.  First, 
we  bound  the  regret  caused  by  arm  switching  based  on  the  following  lemma. 

Lemma  1  [3]:  Consider  an  irreducible,  aperiodic  Markov  chain  with  state  space  S,  matrix  of 
transition  probabilities  P,  an  initial  distribution  q  which  is  positive  in  all  states,  and  stationary 
distribution  7?  (tts  is  the  stationary  probability  of  state  s).  The  state  (reward)  at  time  t  is  denoted 
by  s(t).  Let  //  denote  the  mean  reward.  If  we  play  the  chain  for  an  arbitrary  time  T,  then  there 
exists  a  value  Ap  <  (mins&s  7rs)  1  s  such  that  EEt=i  s<X)  ~  LT]  <  4p- 

Lemma  1  shows  that  if  the  player  continues  to  play  one  arm  for  time  T,  the  difference  between 
the  expected  reward  and  T fi  can  be  bounded  by  a  constant  that  is  independent  of  T.  This  constant 
is  an  upper  bound  for  the  regret  caused  by  each  arm  switching.  If  there  are  only  logarithmically 
many  arm  switchings  as  times  goes,  the  regret  caused  by  arm  switching  has  a  logarithmic  order. 
An  upper  bound  on  the  number  of  arm  switchings  is  shown  below.  It  is  developed  by  bounding 
the  numbers  of  the  exploration  epochs  and  the  exploitation  epochs  respectively. 

For  the  exploration  epochs,  by  time  t,  if  the  player  has  began  to  play  the  (n  +  l)th  exploration 
epoch,  we  have 

^(4n  —  1)  <  -D Inf,  (10) 

where  |(4n  —  1)  is  the  time  spent  on  each  arm  in  the  first  n  exploration  epochs. 

Consequently  the  number  of  the  exploration  epochs  can  be  bounded  by 

no{t)  <  Ll°g4(3-D  In t  +  1)J  +  1.  (11) 


By  time  t,  at  most  (t  —  N )  time  slots  have  been  spent  on  the  exploitation  epochs.  Thus 

niit)  <  flog4(^(f  -  N)  +  l)] .  (12) 
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Hence  an  logarithmic  upper  bound  of  the  first  term  in  (9)  is 

N  Ti(t) 

(»))]]  <  (flog4(-(t  —  N)  +  1)1 

i= 1  n= 1 

+N([logA(3Dlnt  +  1)J  +  l))maxA,  (13) 

i 

where  A  =  (minse5.  tt*)"1  £se5,  s. 

Next  we  show  that  the  second  term  of  (9)  has  a  logarithmic  order.  The  approach  here  is  to 
show  that  for  every  bad  arm  i,  K\T,(t)]  has  a  logarithmic  order.  Let  T^oit)  denote  the  time  spent 
on  arm  i  in  the  exploration  epochs  by  time  t.  Let  TlI(t)  denote  the  time  spent  on  arm  i  in  the 
exploitation  epochs  by  time  t.  So  we  have 

Ti(t )  =  Tito(t)  +  Tij(t),  (14) 

We  will  show  that  both  E [Ti0(t)]  and  E|7)  /(f )]  have  logarithmic  orders. 

The  logarithmic  order  of  E [Ti:0(t)\  follows  directly  from  (11),  i.e., 

Til0(t)  <  |[4(3Z9 lnt  +  1)  -1],  (15) 

The  logarithmic  order  of  E[Tj  /(t)]  is  established  by  bounding  Pr[z,  n],  the  probability  that 
arm  i  is  played  in  the  nth  exploitation  epoch. 

Recall  that  if  arm  i  is  selected  in  the  nth  exploitation  epoch,  it  will  be  played  for  2  x  4,n~ 1 1 
times.  From  the  upper  bound  on  the  number  of  the  exploitation  epochs  given  in  (12),  we  thus 
have 

fTog4(|(t-^)+l)l 

E [Tij(t)}  <  2  x  4n_1  Pr[i,  n]  (16) 

71=1 

riog4(!(t-M)+i)i 

<  ^2  3fn  Pr[i,  n],  (17) 

71=  1 

where  tn  denote  the  starting  time  of  the  nth  exploitation  epoch  and  (17)  follows  from  the  fact 
that  tn  >  |4n_1.  Notice  that  (17)  has  only  logarithmically  many  terms,  if  each  term  can  be 
bounded  by  a  fixed  constant,  i.e.,  if  Pr[z.  n]  has  an  order  of  t~l,  then  the  sum  has  a  logarithmic 
order. 

Let  Ct,w  =  \/(E  hi  t/w)  denote  the  second  part  of  the  RUCB  index.  If  arm  i  is  played  in  the 
nth  exploitation  epoch,  then 

tn^Wi  tn ,  Such  that  S  (tn)  T  ^tn,w  E  Si(tn)  T  (13) 
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We  thus  have 


t-n  1  t"n  1 


(19) 


w=l  Wi=D  In  tn 
tn  1  tn  1 


w= 1  Wi=D  In  tn 


+  Pr[/i*  <  Hi  +  2 CtntWi\) 


(20) 


(21) 


w= 1  Wi=D  In  tn 


where  (21)  follows  from  the  fact  that  Wi  >  Dlntn  . 

Next  we  bound  Pi[st(tn)  >  /q  +  CtrijWi]  and  Pr[s*(tn)  <  f-i*  -  Ctn,w].  The  event  s*(fn)  > 
fJ>i  +  CtniWi  is  equivalent  to 


(22) 


The  inequality  (22)  is  the  event  that  the  sample  mean  from  multiple  epochs  for  arm  i  is  too 
high.  This  event  implies  that  the  sample  mean  from  at  least  one  epoch  is  significantly  higher 
than  the  true  mean.  Notice  that  the  tolerant  deviation  in  (22)  is  of  the  form  \JLwi  In  tn.  It  is 
convenient  if  the  tolerant  deviation  for  each  epoch  is  of  the  form  Cy/Lw  lnin,  where  w  is  the 
number  of  plays  done  on  one  arm  in  one  epoch  and  C  is  a  constant  independent  of  w.  In  this 
way,  the  tolerant  deviations  for  the  sample  mean  in  each  epoch  and  in  all  the  epochs  are  of 
similar  forms.  The  possible  values  for  the  number  of  plays  in  the  exploitation  epochs  are  2  x  4n. 
The  possible  values  of  the  numbers  of  plays  done  on  an  arm  in  the  exploration  epochs  are  4n. 
Consequently  it  can  be  assumed  that  the  player  has  spent  time  Wi  on  arm  i  by  playing  the  epochs 
with  lengths  of  2ni_1,  2n2_1,  •  •  •  ,  2nK_1,  with  each  rij  distinct.  Thus  wy  =  i  2nj~1  and 


j= 1 

K 


(23) 
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The  tolerant  deviation  for  an  continuous  period  of  play  with  length  2nj~1  is  (\/2— 1)a JL  \ntn2ni~l. 
Let  Ri(w )  denote  the  reward  gained  from  arm  i  in  a  period  with  length  w.  An  upper  bound  on 

Pr  [wiSi(tn)  >  WiHi  +  \J L  In  tnWi\  is  derived  below 


Pr  [wiSi(tn)  >  Wi^i  +  \J L  In  tnWi\ 


K 

3 

3- 1  -1 

<  J2  Pr[^(2n’_1)  >  !H  '  2nj-1  +  y/L\atn 
3= 1 

1\ 

E  2”*_1  -  \ 

k=i  \ 

y  ^ 2nk~i 

k=i 

<  ^Pr[JRi(2n^'-1)  >  ^  ■  2nj~1  +  (V2-  l)yf2ni~lL\ntn].  (24) 

3= 1 

The  probability  Pr[f?,(2n^1)  >  //.,  •  2nj~  1  +  [\J2  —  1 ) \/2n'i~xL  In  tn]  is  for  the  event  that  the 
sum  of  reward  during  a  period  of  time  of  length  2nj~ 1  from  arm  i  is  significantly  deviated  from 
Hi 2nj~1.  It  can  be  written  in  terms  of  the  numbers  of  occurrences  of  states.  Specifically,  let 
Ols(w)  denote  the  number  of  occurrences  of  state  s  from  arm  *  in  a  period  with  length  w,  we 
have 

Pr  [Ri( 2n^1)  >  fii  '  2ni~1  +  (V2  —  1)  v/2"J_1Llnfn] 

=  Pr[^(— sO*(2^-1)  +  s2^-17t*)  <  -(V2-1  )y/2nJ~1Llntn].  (25) 

S(zSi 

The  above  equality  leads  to 

Ri( 2nj~1)  >  fii  ■  2nj~1  +  (y/2  —  l)\/2ni-1L\\\tn  implies  that 

—Osi(2nj~1)  +  27lj~1'Kis  <  ~(V2  -  l)^2ni-1Llntn/(s\Si\)  for  some  s  E  Si.  (26) 

Thus  the  event  that  the  sample  mean  is  significantly  deviated  from  the  true  mean  implies  that 
at  least  one  state  occurs  much  often  than  predicted  by  its  stationary  probability. 

Lemma  2  below  is  used  to  bound  the  probability  that  a  state  occurs  much  often  than  predicted 
by  its  stationary  probability. 

Lemma  2  (Chernoff  Bound,  Theorem  2.1  in  [16]):  Consider  a  finite  state,  irreducible,  aperiodic 
and  reversible  Markov  chain  with  state  space  S,  matrix  of  transition  probabilities  P,  and  an  initial 
distribution  q.  Let  lVq  =  \(^-),x  e  <S|2.  Let  e  =  1  — A2,  where  A2  is  the  second  largest  eigenvalue 
of  the  matrix  P.  e  will  be  referred  to  as  the  eigenvalue  gap.  Let  4c5.  Let  TA(t)  be  the  number 
of  times  that  states  in  the  set  A  are  visited  up  to  time  t.  Then  for  any  7  >  0,  we  have 

Pr (TA(t)  -  t'K A  >  7)  <  (1  +  ^)^qe-72£/2W 


(27) 
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Using  Lemma  2,  we  have 

Pr[JRi(2^-1)  >  /a  ■  2nj~1  +  {y/2-  l)^2ni~l  L\ntn] 

<  J2  P[—Of(2nj~1)  +  2nj~17ris  <  -(y/2  -  l)^2ni~1Llntn/{s\Sl\)} 


J2  p[°t( 2nj'~1)  -  >(V2-  1)  y/ 2ni~lL  In  tnj (s|«Sj 


sEtSj 

<  +  ^\/^ln^/2nj~Y^t_(3_2V2)(LeV(20(S)2|5,|2))) 

sG<Si 


< 


15, 


VTr, 


2^(  20«giax|S| 


Le^in~10an 

max 


10s 


Si"“  i  S  =(4tBt!> 


2 

max 


mm 
2 


+  10s^ax)  and  K  <tn  in  (23),  we  have 


Si\ 


Pr[si(tn)  >  Hi  +  Ct^J  <  - (1  + 


VTr, 


10s  ■ 

-LUOmir 


)C- 


Similarly,  it  can  be  shown  that 


5* 


Pr  [s*(tn)  <  p*  -  Ctn>w]  < - (1  + 


c  Vlx,_ 


vrn 


10sr 


)C- 


E|I?«]  <  n°g4(§(i  -  M)  +  l)13|lSil  +  |lS1(l  +  ^1). 

^  '‘min  ^^^min 

Combining  (9)  (13)  (14)  (15)  (35),  we  can  get  the  upper  bound  of  regret: 

r*{t)  <  (ri°g4(?(^M)  +  l)l)maxA 

Z  i 


+N(  [log4(3i9  In  t  +  1)J  +  1)  max  A, 

i 

+ Yy*  -  /L)(riog4(-(t  -  m)  +  i)i  3— — ■ — -(i  +  — ) 

^  ^"min  J-'^'^min 

i 

+  YiP*  —  Pi)  g  [4(3-0  lnl  +  1)  —  1]). 

i 


(28) 

(29) 

(30) 

(31) 

(32) 


(33) 


(34) 


(35) 


(36) 


We  point  out  that  the  same  Chemoff  bound  given  in  Lemma  2  is  also  used  in  [6]  to  handle  the 
rested  Markovian  reward  MAB  problem.  Note  that  the  Cheroff  bound  in  [16]  requires  that  all  the 
observations  used  in  calculating  the  sample  means  (.s,  and  s*  in  (21))  are  from  a  continuously 
evolving  Markov  process.  This  condition  is  naturally  satisfied  in  the  rested  MAB  problem. 
However,  for  the  restless  MAB  problem  considered  here,  the  sample  means  are  calculated  using 
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observations  from  multiple  epochs,  which  are  noncontiguous  segments  of  the  Markovian  sample 
path.  As  detailed  in  the  above  proof,  the  desired  bound  on  the  probabilities  of  the  events  in  (21) 
is  ensured  by  the  carefully  chosen  (growing)  lengths  of  the  exploration  and  exploitation  epochs. 


Appendix  B.  Proof  of  Theorem  2 

The  choice  of  L(t)  and  D(t)  implies  that  79(f)  — >  oo  as  t  — *  oo.  By  the  same  reasoning 
in  the  proof  of  Theorem  1,  the  regret  has  three  parts:  The  regret  caused  by  arm  switching,  the 
regret  caused  by  playing  bad  arms  in  the  exploration  epochs,  and  the  regret  caused  by  playing 
bad  arms  in  the  exploitation  epochs.  It  will  be  shown  that  each  part  part  of  the  regret  is  on  a 
lower  order  than  f(t )  log(f). 

The  number  of  arm  switchings  is  upper  bounded  by  N\og2(t/N  +  1).  So  the  regret  caused 
by  arm  switching  is  upper  bounded  by 


N  log 2(t/N  +  1)  max  Aj, 


(37) 


where  A;  =  (minse5i  7r*)  1  s ■  Since  f(t )  — >  oo  as  f  — >  oo,  we  have 

N  log 2(f/7V  +  1)  max,  Aj 

hm - - — — - =  0. 

°°  f(t)  log (f) 

Thus  the  regret  caused  by  arm  switching  is  on  a  lower  order  than  /(f)  log (f). 

The  regret  caused  by  playing  bad  arms  in  the  exploration  epochs  is  bounded  by 

~  [4 (377(f)  Inf  +  1)  -  1]. 


(38) 


(39) 


Since  jj^y  — »■  oo  as  f  — >  oo,  we  have 


*  ™  /(f)  log(f) 


=  0. 


(40) 


Thus  the  regret  caused  by  playing  bad  arms  in  the  exploration  epochs  is  on  a  lower  order  than 

/(f)  log(f). 

For  the  regret  caused  by  playing  bad  arms  in  the  exploitation  epochs,  it  is  shown  below  that 
the  time  spent  on  a  bad  arm  i  can  be  bounded  by  a  constant  independent  of  f. 

oo  as  f  — >  oo,  there  exists  a  time  U  such  that  Vf  >  f  ?,  D(t)  > 


Since  ^ 


There  also  exists  a  time  f4  such  that  Vf  >  f4,  L(t)  >  _1_  (g 2i  +  l()sfTiax).  The  time 
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spent  on  playing  bad  arms  before  t5  =  nra x(t3,  f4)  is  at  most  t5,  and  the  caused  regret  is  at  most 
(/x*  —  /X(j( tv) )^5*  After  t5,  the  time  spent  on  each  bad  arm  i  is  upper  bounded  by: 


I&I  +  IS* 


7Tn 


■(1  + 


:\/ -^(^5)  • 


10sr 


(41) 


An  upper  bound  for  the  corresponding  regret  is 

*  VO  1^*1  +  \S*\  Ui  , 

/  — 1 - )(1  + 


7I"min  10smin 

l 

So  the  regret  caused  by  playing  bad  arms  in  the  exploitation  epochs  is 


(ff  ~  V*(N))h  +  -  Hi)( 3 


ISA  +  15* 


VTn 


0(1  + 


ax 

10s  ■ 

-L\JO  m  1 T1 


), 


(42) 


(43) 


which  is  a  constant  independent  of  time  t.  Thus  the  regret  caused  by  playing  bad  arms  in  the 
exploration  epochs  is  on  a  lower  order  than  f(t)  log  (t). 

Because  each  part  of  the  regret  is  on  a  lower  order  than  f(t)  log(t),  the  total  regret  is  also  on 
a  lower  order  than  f(t)  log(i). 
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